58 research outputs found

    Stack-less SIMT reconvergence at low cost

    Get PDF
    Parallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, we propose a technique to handle SIMT control divergence that operates in constant space and handles indirect jumps and recursion. We describe a possible implementation which leverage the existing memory divergence management unit, ensuring a low hardware cost. In terms of performance, this solution is at least as efficient as existing techniques

    Une architecture unifiée pour traiter la divergence de contrôle et la divergence mémoire en SIMT

    Get PDF
    National audienceLes architectures parallèles qui suivent le modèle SIMT telles que les GPU tirent parti de la régularité des applications en exécutant plusieurs threads concurrents sur des unités SIMD de manière synchronisée. Cela nécessite de séparer les threads lorsqu'il prennent des chemins différents dans le graphe de contrôle de flot, et d'être à même de les re-synchroniser dès que possible de manière à maximiser l'occupation des unités SIMD. Nous proposons dans cet article une technique pour traiter la divergence de contrôle en SIMT qui opère en espace constant et gère les sauts indirects et la récursivité. Nous décrivons une réalisation possible qui s'appuie sur le matériel existant de l'unité de gestion de la divergence mémoire, assurant un coût matériel très réduit. En termes de performance, cette solution est au moins aussi efficace que les techniques existantes

    Reconvergence de contrĂ´le implicite pour les architectures SIMT

    Get PDF
    National audienceParallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, we propose two techniques to handle SIMT control divergence and identify reconvergence points. The most advanced one operates in constant space and handles indirect jumps and recursion. We evaluate a hardware implementation which leverage the existing memory divergence management unit. In terms of performance, this solution is at least as efficient as state of the art techniques in use in current GPUs.Les architectures parallèles qui obéissent au modèle SIMT telles que les GPU tirent parti de la régularité des applications en exécutant plusieurs threads concurrents sur des unités SIMD de manière synchrone. Lorsque les threads empruntent des chemins divergents dans le graphe de flot de contrôle, leur exécution est séquentialisée jusqu'au prochain point de convergence. La reconvergence doit être effectuée au plus tôt de manière à maximiser l'occupation des unités SIMD. Nous proposons dans cet article deux techniques permettant de traiter la divergence de contrôle en SIMT et d'identifier dynamiquement les points de reconvergence, dont une qui opère en espace constant et gère les sauts indirects et la récursivité. Nous évaluons une réalisation matérielle consistant à partager le matériel existant de l'unité de gestion de la divergence mémoire. En termes de performances, cette solution est au moins aussi efficace que les techniques de l'état de l'art employés par les GPU actuels

    Assouplir les contraintes des architectures SIMT à faible coût

    Get PDF
    Parallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, we propose two techniques to handle SIMT control divergence and identify reconvergence points. The most advanced one operates in constant space and handles indirect jumps and recursion. We evaluate a hardware implementation which leverage the existing memory divergence management unit. In terms of performance, this solution is at least as efficient as state of the art techniques in use in current GPUs.Les architectures parallèles qui obéissent au modèle SIMT telles que les GPU tirent parti de la régularité des applications en exécutant plusieurs threads concurrents sur des unités SIMD de manière synchrone. Lorsque les threads empruntent des chemins divergents dans le graphe de flot de contrôle, leur exécution est séquentialisée jusqu'au prochain point de convergence. La reconvergence doit être effectuée au plus tôt de manière à maximiser l'occupation des unités SIMD. Nous proposons dans cet article deux techniques permettant de traiter la divergence de contrôle en SIMT et d'identifier dynamiquement les points de reconvergence, dont une qui opère en espace constant et gère les sauts indirects et la récursivité. Nous évaluons une réalisation matérielle consistant à partager le matériel existant de l'unité de gestion de la divergence mémoire. En termes de performances, cette solution est au moins aussi efficace que les techniques de l'état de l'art employés par les GPU actuels

    Affine Vector Cache for memory bandwidth savings

    Get PDF
    Preserving memory locality is a major issue in highly-multithreaded architectures such as GPUs. These architectures hide latency by maintaining a large number of threads in flight. As each thread needs to maintain a private working set, all threads collectively put tremendous pressure on on-chip memory arrays, at significant cost in area and power. We show that thread-private data in GPU-like implicit SIMD architectures can be compressed by a factor up to 16 by taking advantage of correlations between values held by different threads. We propose the Affine Vector Cache, a compressed cache design that complements the first level cache. Evaluation by simulation on the SDK and Rodinia benchmarks shows that a 32KB L1 cache assisted by a 16KB AVC presents a 59% larger usable capacity on average compared to a single 48KB L1 cache. It results in a global performance increase of 5.7% along with an energy reduction of 11% for a negligible hardware cost

    Options for Denormal Representation in Logarithmic Arithmetic

    Get PDF
    International audienceEconomical hardware often uses a FiXed-point Number System (FXNS), whose constant absolute precision is acceptable for many signal-processing algorithms. The almost-constant relative precision of the more expensive Floating-Point (FP) number system simplifies design, for example, by eliminating worries about FXNS overflow because the range of FP is much larger than FXNS for the same wordsize; however, primitive FP introduces another problem: underflow. The conventional Signed Logarithmic Number System (SLNS) offers similar range and precision as FP with much better performance (in terms of power, speed and area) for multiplication, division, powers and roots. Moderate-precision addition in SLNS uses table lookup with properties similar to FP (including underflow). This paper proposes a new number system, called the Denormal LNS (DLNS), which is a hybrid of the properties of FXNS and SLNS. The inspiration for DLNS comes from the denormal (aka subnormal) numbers found in IEEE-754 (that provide better, gradual underflow) and the μ-law often used for speech encoding; the novel DLNS circuit here allows arithmetic to be performed directly on such encoded data. The proposed approach allows customizing the range in which gradual underflow occurs. A wide gradual underflow range acts like FXNS; a narrow one acts like SLNS. The DLNS approach is most affordable for applications involving addition, subtraction and multiplication by constants, such as the Fast Fourier Transform (FFT). Simulation of an FFT application illustrates a moderate gradual underflow decreasing bit-switching activity 15% compared to underflow-free SLNS, at the cost of increasing application error by 30%. DLNS reduces switching activity 5% to 20% more than an abruptly-underflowing SLNS with one-half the error. Synthesis shows the novel circuit primarily consists of traditional SLNS addition and subtraction tables, with additional datapaths that allow the novel ALU to act on conventional SLNS as well as DLNS and mixed data, for a worst-case area overhead of 26%. For similar range and precision, simulation of Taylor-series computations suggest subnormal values in DLNS behave similarly to those in the IEEE-754 FP standard. Unlike SLNS, DLNS approach is quite costly for general (non-constant) multiplication, division and roots. To overcome this difficulty, this paper proposes two variation called Denormal Mitchell LNS (DMLNS) and Denormal Offset Mitchell LNS (DOMLNS), in which the well-known Mitchell's method makes the cost of general multiplication, division and roots closer to that of SLNS. Taylor-series computations suggest subnormal values in DMLNS and DOMLNS also behave similarly to those in the IEEE-754 FP standard. Synthesis shows that DMLNS and DOMLNS respectively have average area overheads of 25% and 17% compared to an equivalent SLNS 5-operation unit.Les circuits intégrés économiques utilisent souvent des systèmes de numération en virgule fixe, dont la précision absolue constante est acceptable pour de nombreux algorithmes de traitement du signal. La précision relative quasi-constante du système virgule flottante, plus coûteux, simplifie la conception, en éliminant notamment le risque de débordement par le haut, la dynamique du flottant étant bien plus grande qu'en virgule fixe. Cependant, le flottant primitif induit un autre problème : le débordement par le bas (underflow). Le système logarithmique conventionnel (SLNS) offre une dynamique et une précision similaire au flottant, pour des performances bien meilleures (en termes de consommation, vitesse et surface) pour la multiplication, la division, les puissances et les racines. L'addition en précision moyenne en SLNS est basées sur des accès à des tables, avec des propriétés similaires au flottant (incluant le débordement par le bas). Cet article propose trois variations autour d'un nouveau système de représentation des nombres, respectivement appelées Denormal LNS (DLNS), Denormal Mitchell LNS (DMLNS) et Denormal Offset Mitchell LNS (DOMLNS), qui sont toutes des hybrides des propriétés de la virgule fixe et du SLNS. L'inspiration de D(OM)LNS vient des nombre dénormaux (ou sous-normaux) de la norme IEEE-754, qui fournissent un débordement par le bas graduel, et le codage µ-law utilisé dans la transmission de la voix. Le nouveau circuit DLNS proposé permet de calculer directement sur les données codées. L'approche proposée permet d'ajuster l'intervalle dans lequel le débordement progressif intervient. Une plage large se comporte comme la virgule fixe, une étroite comme le SLNS. L'approche DLNS est la plus économique pour les applications impliquant des additions, soustractions et multiplications par des constantes, telles que les transformées de Fourier rapides (FFT). Notre première mise en {\oe}uvre s'appuie sur les blocs de base existant d SLNS. Des synthèses montrent que le nouveau circuit est constitué principalement des tables d'additions SLNS traditionnelles, avec des chemins de données supplémentaires qui permettent à la nouvelle unité d'opérer sur des données SLNS, DLNS ou mixtes, pour un surcoût en surface de 26% dans le pire cas. Contrairement au SLNS, cette réalisation de DLNS reste coûteuse pour la multiplication générique, la division et les racines. Pour surmonter cette difficulté, cet article propose les variations DMLNS et DOMLNS, pour lesquelles la méthode de Mitchell rapproche le coût des multiplications génériques, divisions et racines de leurs équivalents en SLNS. Des calculs sur des séries de Taylor suggèrent que les valeurs sous-normales en DMLNS et DOMLNS se comportent également de manière similaires à celles de la norme IEEE-754. Des synthèses montrent que DMLNS et DOMLNS offrent des surcoûts respectifs de 25% et 17% par rapport à une unité SLNS à 5 opérations équivalente

    Simty: generalized SIMT execution on RISC-V

    Get PDF
    International audienceWe present Simty, a massively multi-threaded RISC-V processor core that acts as a proof of concept for dynamic inter-thread vector-ization at the micro-architecture level. Simty runs groups of scalar threads executing SPMD code in lockstep, and assembles SIMD instructions dynamically across threads. Unlike existing SIMD or SIMT processors like GPUs or vector processors, Simty vector-izes scalar general-purpose binaries. It does not involve any instruction set extension or compiler change. Simty is described in synthesizable RTL. A FPGA prototype validates its scaling up to 2048 threads per core with 32-wide SIMD units. Simty provides an open platform for research on GPU micro-architecture, on hybrid CPU-GPU micro-architecture, or on heterogeneous platforms with throughput-optimized cores

    Simultaneous Branch and Warp Interweaving for Sustained GPU Performance

    Get PDF
    International audienceSingle-Instruction Multiple-Thread (SIMT) micro-architectures implemented in Graphics Processing Units (GPUs) run fine-grained threads in lockstep by grouping them into units, referred to as warps, to amortize the cost of instruction fetch, decode and control logic over multiple execution units. As individual threads take divergent execution paths, their processing takes place sequentially, defeating part of the efficiency advantage of SIMD execution. We present two complementary techniques that mitigate the impact of thread divergence on SIMT micro-architectures. Both techniques relax the SIMD execution model by allowing two distinct instructions to be scheduled to disjoint subsets of the the same row of execution units, instead of one single instruction. They increase flexibility by providing more thread grouping opportunities than SIMD, while preserving the affinity between threads to avoid introducing extra memory divergence. We consider (1) co-issuing instructions from different divergent paths of the same warp and (2) co-issuing instructions from different warps. To support (1), we introduce a novel thread reconvergence technique that ensures threads are run back in lockstep at control-flow reconvergence points without hindering their ability to run branches in parallel. We propose a lane shuffling technique to allow solution (2) to benefit from inter-warp correlations in divergence patterns. The combination of all these techniques improves performance by 23% on a set of regular GPGPU applications and by 40% on irregular applications, while maintaining the same instruction-fetch and processing-unit resource requirements as the contemporary Fermi GPU architecture

    Scheduling paths leveraging dynamic information in SIMT architectures

    Get PDF
    International audienceThread divergence optimization in GPU architectures have long been hindered by restrictive control-flow mechanisms based on stacks of execution masks. However, GPU architectures recently began implementing more flexible hardware mechanisms, presumably based on path tables. We leverage this opportunity by proposing a hardware implementation of iteration shifting, a divergence optimization that enables lockstep execution across arbitrary iterations of a loop. Although software implementations of iteration shifting have been previously proposed, implementing this scheduling technique in hardware lets us leverage dynamic information such as divergence patterns and memory stalls. Evaluation using simulation suggest that the expected performance improvements will remain modest or even nonexistent unless the organization of the memory access path is also revisited

    The Denormal Logarithmic Number System

    Get PDF
    International audienceEconomical hardware often uses a FiXed-point Number System (FXNS), whose constant absolute precision is acceptable for many signal-processing algorithms. The almost-constant relative precision of the more expensive Floating-Point (FP) number system simplifies design, for example, by eliminating worries about FXNS overflow because the range of FP is much larger than FXNS for the same wordsize; however, primitive FP introduces another problem: underflow. The conventional Signed Logarithmic Number System (SLNS) offers similar range and precision as FP with much better performance (in terms of power, speed and area) for multiplication, division, powers and roots. Moderate-precision addition in SLNS uses table lookup with properties similar to FP (including underflow). This paper proposes a new number system, called the Denormal LNS (DLNS), which is a hybrid of the properties of FXNS and SLNS. The inspiration for DLNS comes from the denormal numbers found in IEEE-754 (that provide better, gradual underflow) and the µ-law often used for speech encoding; the novel DLNS circuit here allows arithmetic to be performed directly on such encoded data. The proposed approach allows customizing the range in which gradual underflow occurs. A wide gradual underflow range acts like FXNS; a narrow one acts like SLNS. Simulation of an FFT application illustrates a moderate gradual underflow decreasing bit-switching activity 15% compared to underflow-free SLNS, at the cost of increasing application error by 30%. DLNS reduces switching activity 5% to 20% more than an abruptly-underflowing SLNS with one-half the error. Synthesis shows the novel circuit primarily consists of traditional SLNS addition and subtraction tables, with additional datapaths that allow the novel ALU to act on conventional SLNS as well as DLNS and mixed data, for a worst-case area overhead of 26%
    • …
    corecore